Introduction

Our planet faces massive economic, social and environmental challenges. To combat these, the United Nations established a universal initiative known as the sustainable development goals (SDGs) that define global priorities and aspirations for 2030 in the form of 17 goals and 169 targets.Footnote 1 The goals represent a pathway to eliminate extreme poverty and aims to put the world on a sustainable path. Achieving the SDGs is not just a political commitment but an opportunity to establish a technology transition (Nations, 2021). The post-2015 sustainable development agenda calls on all countries to enhance research, upgrade technological capabilities, encourage innovation and increase public and private investment (Giovannini et al. 2015).

Science, technology and innovation (STI, as referred to in the UN and OECD contexts) have been recognized as one of the main drivers behind productivity increases and a key long-term lever for economic growth and prosperity (Daejeon Declaration, 2015). STI is a fundamental tool to implement the new agenda, as it allows improving efficiency in both the economy and society, developing new and more sustainable ways to satisfy human needs, and empowering people to drive their own future (Giovannini et al. 2015). In the SDGs framework, STI features strongly in Goal 17, as well as a cross-cutting one to achieve several sectoral Goals and Targets. Fostering innovation is part of Goal 9 related to resilient infrastructure and inclusive, sustainable industrialization, while Target 9.5 elevates the role of research and innovation policy well beyond STI as one of the means of implementation.

In the United Nations general assembly briefing materials, the importance of science, technology and innovation (STI) for the SDGs has been numerously mentioned in yearly forums.Footnote 2 A direct quote from Marie Chatardová, President of the Economic and Social Council, at the 2018 New York STI Forum “No one can ignore the vital role of science, technology and innovation (STI) in advancing the transformative impact”.Footnote 3 Similarly, the Technology Adviser to the US Secretary of State said that the integrated nature of the SDGs requires multi-disciplinary and holistic science, technology and innovation approaches that break silos and take into account different sources of knowledge at the concluding session of the Forum.

Innovation in general and innovation in the context of sustainable development affects many parts of human life and should thus be treated with concern. The SDG framework is built on the expectation of technological development and innovation. For example, Goal 9 needs to be highlighted in the context of innovation. This goal states that the objective by the United Nations is to “promote inclusive and sustainable industrialization and foster innovation”. STI is a democratizing tool for transferring science to society and can be instrumental in introducing solutions for achieving SDGs. The importance of innovation in reaching sustainable development is also recognized by previous research, for example by Ashford and Hall (2011).

One of the main ways how STI’s oriented efforts are manifested is through scholarly literature and intellectual property protected in the form of patents. The ability to analyse science and technology output has increased tremendously in the past decade due to the increasing degree of digitalization of research articles and intellectual property databases (e.g., Web of Science, Scopus, PATSTAT, Google Patents). The deep technical analysis of patents’ and publications' textual content and bibliographic metadata provides valuable interpretation for the most complex and technology-oriented artefacts. To a significant extent, previous literature still focuses on descriptive values rather than creating in-depth data that creates additional vantage points to evaluate STI’s orientation towards SDGs in overall innovation systems. This research aims to extend the advancements in the quantitative analysis of STI to enhance detection and instances of SDGs in documented scientific and patented materials. Advance text analytics tools combined with artificial intelligence techniques offer the opportunity to analyse science and technology development through publications on a large scale, with easy implementation, and in a harmonized manner with long time horizons.

This research focuses on analysing science and technology literature contribution to SDGs using machine learning methods. Using a limited training data incorporated from a lexical based search query in a publication database, a machine learning model is compiled to identify the relevance of patents towards SDGs on a larger scale. Furthermore, the machine learning model will be evaluated to estimate its extension of SDG oriented artifact coverage in comparison to a standard lexical based query-based search.

The paper will continue to present a background study and outline the research design. After that, the methodological approach is described, followed by a presentation of the results. Finally, the results will be discussed in the last section, elaborating on further implications and future research.

Background

The science, technology and innovation perspective towards SDGs

STI and the interaction between different actors are the core indicators for economic growth (Dosi et al., 2006; Freeman, 2004; Nelson & Sidney, 1982). Increase in scientific and technological knowledge production acts as the key source of innovation and competitive advantage (Pavitt, 1991). This has been central to our understanding of the competitiveness of nations and the competitiveness at the firm level. The centrality of the concept of productivity and its increase to sustain the long-term competitiveness of nations has been a vital paradigm of economic policy for decades. This also explains that much of the literature on the STI process focuses on innovation outcomes (Kahn, 2018). Much of the focus is centred on scientific work and research and development within the innovation system within this literature. These are seen as vehicles to enable job creation, firm performance and ultimately increases in the gross domestic product (Fukuda, 2020; Goos et al., 2015; Klomp & Van Leeuwen, 2001).

However, an ongoing debate is focused on extending our focus beyond productivity or gross domestic product to other impact measures (Stiglitz et al., 2018). Global challenges like the climate crisis have strengthened these developments, which has led governments and businesses to reassess the role of pure productivity as a goal. The climate crisis has strengthened the call for additional outcome measures for innovation system activities in the public sector, often referred to as the Beyond GDP measurement framework (Hayden, 2021; Malay, 2019, 2021; Schreyer, 2021). This has been clear in broadening public impact assessment (Nieminen & Hyytinen, 2015). Policy has also actively discussed the role of grand challenges and the role of governments to take an active role in facilitating transitions unlikely to happen through other means but creating significant overall benefits (e.g. Mazzucato, 2011). We have also seen significant transitions in company leaders' positions to the role of companies in grand challenges. Large companies’ CEOs' call to extend the firm’s objectives beyond shareholder value (Gelles & Yaffe-Bellany, 2019) can be considered a major transition beyond the current paradigm towards a sustainable economy.

In this transition, the work of the United Nations on the creation of SDGs has been central. SDGs offer one of the first holistic taxonomies of grand challenges. With the advent of the SDG framework, as well as a shift in thinking, the innovation system has had to undergo transformative changes (Schot & Steinmueller, 2018). This again is mainly discussed in the policy domain, but the literature also looks at the role of industry and innovation activities concerning the SDGs. There is a need to adjust all aspects of economic, governance, and public policy at all levels if science, technology, and innovation to reorient to the SDG agenda (Walsh et al., 2020).

It is clear that the SDG framework calls for broad changes in technologies, policies and innovation (Leach et al., 2012). The sustainability transition expects that the socio-technical regime will realign to produce a totally different type of value as compared to the current regime (Schot & Steinmueller, 2018). Hajer et al. (2015) call for deeper integration of “planetary boundaries”, “safe and just operating space”, “energetic society” and “green competition” to realize the sustainability transition described through the SDGs. Many argue that it is the private sector that is apt to respond to the transformation (Scheyvens et al., 2016) by taking advantage of new technological solutions (Sinha et al., 2020), such as Industry 4.0 (Bonilla et al., 2018). We have seen evidence that the SDG framework has been used as a policymaking tool and frame for corporate innovation, partnerships, and strategy development (Sullivan et al., 2018).

The relevance of mapping and monitoring STI interaction and development is crucial for understanding the dynamics behind the innovative performance, growth and competitiveness of nations and even firms. Indicators signalling interactions between scientific and technological activities are highly relevant in this respect (Ranaei et al., 2017). Researchers and policymakers have for several decades recognized patents as valid and reliable indicators of technology development and innovation (Callaert et al., 2014). Patent documents contain essential research results that are valuable to the industry, business, and policymaking communities. If carefully analysed, they can show technological details and relations, reveal business trends, inspire novel industrial solutions, or help make investment policy (Campbell, 1983). In addition, patent data is an essential source of information for the company's STI policymakers and other stakeholders. Analysis of patent data could essentially enrich the scope and depth of strategic technology policymaking, affect the alignment of a company's innovation strategies, as well as the evaluation of R&D proposals and the assessment of technology competitiveness (Ena, 2021).

Considering patent documents, we estimate the impacts of industrial activity on the SDGs. This requires the development of practical proxy measures to establish a measurement of corporate activities’ impact on the goals. For example, approaches towards measuring the societal impact of innovation have been done in the context of frugal innovation (Altgilbers et al., 2020). In the SDG context, van der Waal et al. (2021) used patents to explore the impact of firms on the SDG goals. Similar to Xie and Miyazaki (2013), the authors extend the established patent classification based analysis of Migotto and Haščič (2015), e.g. green technologies, into patent applications' content. This requires building a taxonomy of terms to be identified from the patent text, subsequently informing on the SDG relevance of the document.

However, the literature has not shown practical approaches for creating a proxy measure for large scale analysis of STI impact on the SDGs. Our study attempts to measure SDG relevancy to intellectual property (IP) types of documents such as patents. The approach used is based on utilizing the classification of scientific documents based on their relevance to SDGs. While patent documents differ from scientific documents, previous research has shown that using natural language processing and machine learning can model the interaction between science and technology documents (Ranaei et al., 2017). The approach used relies on an existing classification of scientific publications’ relevance to SDGs as a gold standard, trains a model with scientific publications and transfers the model to patent data to create a classification of patent relevance. We demonstrate the approach on the European Patent Office patent families of 2020 and explore the classification on SDG 7 (clean energy) in depth.

Machine learning-based classification of science and patent documents

Machine learning-based classification of science and patent documents is an important avenue to complement existing human assigned classifications (Suominen & Toivanen, 2015). It has been shown that the modeling of topics can be an effective approach to identifying important scientific and technological (S&T) documents (Yau et al., 2014). Overall, automated text classification is essential for knowledge management and S&T studies (Nedjah et al., 2009). Classifications have been seen to enable the creation of novel ways of mapping science (Suominen & Toivanen, 2015) and businesses capabilities (Suominen et al., 2016). Text classification is not merely an application of an algorithm across a text corpus but rather a process. Classifying text includes several phases, including preprocessing, document modeling, e.g., vector-space, feature selections, and algorithm utilization for model building and evaluation (Mirończuk & Protasiewicz, 2018). An important aspect of the process is the selection of the algorithms implemented, namely the selection between a supervised and unsupervised approach (Samira Ranaei et al., 2019).

Focusing particularly on the classification methods in automated classification of S&T documents, Ranaei et al. (2019) review the use of supervised and unsupervised classification methods in S&T literature classification. Supervised learning, an approach reliant on the use of training data, lists six major approaches used in literature. The most commonly used supervised method to classify S&T related text is the Support Vector Machine (SVM). For example, Kreuchauff and Korzinov (2017) use an SVM approach to patent data and found that the approach allowed them to reduce expert bias and inherent issues in using a citation based approach. In Kenekayoro (2018), SVM was used to accurately identify name entities from academic biographies, enabling a model for classifying information on researchers. The second most often used approach is a Naïve Bayes approach. This is used by Lee and Lee (2019) to identify technological opportunities using patent forward citations. Wang et al. (2019) used the Naïve Bayes approach to predict the success of academic articles. Among the other methods employed, namely KNN and Random Forest, the Naïve Bayes approach has shown consistent performance.

In addition to the two most commonly used approaches, S&T literature classification has also been done using Neural networks, K-Nearest Neighbor, Logistic Regression and Supervised Fuzzy Algorithms (Samira Ranaei et al., 2019). Similar to the ones previously mentioned, these methods are employed to answer S&T-relevant questions. For example, Neural networks have been used to predict academic authors’ success (Mistele et al., 2019) and the impact to patent citations of firms (Chen & Chang, 2010). But in many cases multiple classifiers are used to tackle the research question. For example, this is the case in Wang et al. (2019), where the study considered a Naïve Bayes, K-Nearest Neighbors, SVM, and Logistic Regression approach.

For unsupervised methods, which rely on a formal framework, Ranaei et al. (2019) highlighted eight approaches commonly used. These are Principale Components Analysis (PCA), Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Kmeans, Unsupervised Fuzzy Algorithm, Artificial Neural Network (ANN), Hierarchical clustering, and Probabilistic LSI. For example, Kenekayoro et al. (2015) used PCA on research groups’ website key phrase vectors to cluster similar research. Zhou et al. (2021) used PCA on patent data to identify structures in patent data that could indicate clusters of emerging technology. Their research particularly focuses on the identification of outlier patents. The work of Yau et al. (2014) extensively covers different topic modeling approaches, including LDA. In the work, Yau et al. (2014) tested several approaches of topic modeling to scientific publication clustering and show the potential of methods such as LDA in producing meaningful research clusters. LDA has been since used on science mapping (Suominen & Toivanen, 2015), clustering themes in emerging technologies (Ranaei et al., 2020) and analysing firms’ behaviour through patents (Suominen et al., 2016). ANN is also a reoccurring method in S&T classification task. For example Lu et al. (2020) used a Multi-layer perceptron as a layer in their approach to classify and find similar patents. In addition, Yoon and Phaal (2013) used a K-means approach to classify patent documents in the context of technology road mapping. Yoon et al. (2010) used hierarchical clustering to develop meaningful research and development clusters.

The performance of text classifiers is not simply determined by the number of training examples or the quality of the training set, but may also be influenced by the intrinsic characteristics of the texts being analysed, which can include the size of the vocabulary and the lengthy nature of the document (Figueroa & Zeng-Treitler, 2013). In a comparative study on classification methods’ performances on social media data, authors have foreseen that results may differ for other sources, languages, and classification objectives (Hartmann et al., 2019). The study of Ruijie et al. (2021) on patent text modeling strategy and its classification based on structural features, sample size, and imbalance dataset was vital in exploiting the full advantages of Word2vec and deep learning algorithms. As a result of such sensitivity of context, data type, sample size and its diversity, quality of labelled data, and notably the classification application, the majority of studies have benchmarked various methods for finding the best performer in their niche application.

Meanwhile, specific methods are frequently used and benchmarked in text classification challenges, and reportedly although some perform better than others, the differences among methods’ performances are not significant. On a text classification experiment with BBC news, Shah et al. (2020) revealed a Logistic Regression classifier with the TF-IDF vectorizer feature attains the highest accuracy in a Comparative analysis among Random Forest and KNN Models. The findings of another study also indicated that the Logistic Regression multi-class classification method for product-reviews has achieved the highest classification accuracy in comparison with the Naïve Bayes, Random Forest, Decision Tree, and Support Vector Machines classification methods (Pranckevičius & Marcinkevičius, 2017).

Considering the context of our study, the objectives, data types, availability of labelled data and its quality, we proceed with a research design to experiment with a combination of methods. We acquire a set of methods both on feature designing and classification approaches to allow us to observe the performance across methods. The following section will elaborate on our methodological choices, starting from data collection to machine learning classification and their evaluation.

Methodology

Query crafting and data collection

With the overall aim to identify SDG related science and technology and innovation relevance, the study design is grounded on creating a lexical query by utilizing an iteratively developed database of SDG terminology (Fig. 1). A keyword search was applied to map scientific publications during the period of 2015–2020. Our decision to include papers from 2015 is motivated by the fact that in September 2015, the General Assembly adopted the 2030 Agenda for SDGs (United Nations Department of Public Information, 2015). Therefore, this period has been chosen as it encompasses the most recent activities, at least partly impacted by the SDG agenda. The outcome of this process results in research publications in relation to the SDG focus. Meanwhile, patent documents due to the nature of their content and descriptive, have not given representative results in regard to their relevancy to SDGs by use of SDG lexical search queries. While we have received representative results from publication data, we are able to extrapolate the machine-learning model based on publication data to detect relevant patents to SDGs.

Fig. 1
figure 1

Workflow process of identifying and retrieving SDG related publications

A detailed taxonomy has been developed for mapping SDG relevant publications. To start we utilized the taxonomy curated by Scopus SciValFootnote 4 and its effort for compiling SDG queries in “Identifying research supporting the United Nations Sustainable Development Goals”. Furthermore, the process of curating the lexical keywords is complemented by involving the analyses of the UN Sustainable Development Goals documents (UN, 2015). From the semantics perspective, each word or concept has been expanded to lexically similar words. In addition, the extracted list of keywords was matched with existing taxonomies (Elsevier, 2015; Jia et al., 2019; UNSDG, 2019; Vatananan-Thesenvitz et al., 2019). Using the keywords, gathered queries are compiled for each SDG and then searched from the SCOPUS publication database. SCOPUS is the largest abstract and citation database of peer reviewed literature. It includes books, scientific journals, and conference proceedings. Compared to other scientific databases such as the Web of Science, SCOPUS has a broader coverage and it is a widely used database to create datasets for systematic reviews of research (Mongeon & Paul-Hus, 2016). The resulting publications were manually validated for the relevancy of the resulted records to the corresponding SDG. Figure 2 illustrates the number of publications within each SDG category based on Scopus/Scival query per year (2015–2020).

Fig. 2
figure 2

Top: count of publication per each SDG category for 2015–2020 based on lexical query searched on Scopus. Bottom: SDG category publication volume on top 10% citation percental spectrum

The bibliometric data for the resulting publications were extracted for each SDG category. For each category, we download 2000 top publications based on Scopus “Relevance Score”.Footnote 5 Within the bibliometric data, we extracted the textual content of the publications, such as title, abstract and keywords, used to train a model for automating unseen SDG-related document detection. Appendix 1 offers descriptive statistics of the publication labelled data.

Text modeling and machine learning classification

This research has benefited from the development of machine learning classification algorithms on texts to facilitate the construction of the SDG detecting model. Text classification is one of the research hotspots in the field of Natural Language Processing (NLP). Originated from computer science and evolved from pattern recognition, the automated process of categorization (or classification) of objects such as text has become a growing interest in utilizing machine learning. Additionally, the increasing availability of documents in digital form has led to the development of methods to facilitate comprehending the essence and insight of documents in new ways (Sebastiani, 2001). Text classification by machine learning algorithms allows processing a large amount of text automatically and creating better insights on the documents. Approaches for algorithmic classification of documents can be based on several methods, mainly supervised, unsupervised or reinforced learning. The difference in the approaches can generally be made on either the use of human tagged data, reliance on the algorithms’ framework, or learning by rewards from correct actions. Within this study, the SDG related publications’ content identified in the previous step will be utilized as a training set for the classification algorithm.

The Python programming language was used to handle the data structuring and ML model building.Footnote 6 There are several classification methods within supervised learning for performing a multi-class text classification. We perform a comparison among the top frequent used methods to select the best performing model for our classification task. In the process of validating the accuracy and reliability of the model, we use a test set (new publications) to confirm the usability of the ML model. The adjustment for the size of the training was set to be 70% and the test set 30%. Figure 3 illustrates the workflow of the methodological steps.

Fig. 3
figure 3

Workflow for the sustainable development goals (SDG) mapping of patents

Before applying the classification algorithms on the SDG labelled publications, preprocessing was applied on the texts. The preprocessing involved cleaning procedures (e.g. stop words removal, non-alphanumeric characters removal, stemming and lowercase transformation) applied to harmonize and increase the consistency of the text.Footnote 7 We considered Naive Bayes Classifier for Multinomial Models, Linear Support Vector Machine, Logistic Regression as classifiers and TF-IDF, Word2vec and Doc2vec as text modeling strategies.Footnote 8

Text modeling

TF-IDF

For Natural Language Processing (NLP) to work, it requires transforming natural language (text) into a numerical vector form. Text vectorization techniques, namely TF-IDF, Bag of Words and vectorization, are very popular choices for machine learning algorithms and can help convert text to numeric feature vectors. Therefore, to quantify and convert text into numerical representation in documents, (1) we assign a fixed integer id to each word occurring in any document of the training set and (2) For each document i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature j where j is the index of word w. This method of representation of text implies that n features is the number of distinct words in the corpus that can be handled with today’s computational capacity. This high-dimensional sparse datasets will be built with scikit-learn built in function “CountVectorizer” which supports counts of N-grams of words or consecutive characters. Text preprocessing, tokenizing and filtering of stop words are all included in CountVectorizer, which builds a bundle of features and transforms documents to feature vectors.

Meanwhile occurrence count is a good start but the issue would be longer documents will have higher average count values than shorter documents, even though they might talk about the same topics. This downscaling of features is treated by computing a weight to each phrase that signifies the importance of the phrase in the document and corpus. The TF-IDF method is a widely used technique in Information Retrieval and Text Mining (Manning et al., 2012). Term Frequency Inverse Document Frequency (TF-IDF) is a weighting procedure that tries to evaluate the relevance of terms in the document corpus. As the term implies, TF-IDF calculates values for each phrase in a document through an inverse proportion of the frequency of the phrase in a particular document to the percentage of documents the phrase appears in. The formula below was used to compute the TF-IDF:

$${w}_{jk}=t{f}_{jk}\times id{f}_{j}$$
(1)

\({w}_{jk}=\) phrase weight of phrase \(j\) in document \(k\).

$$id{f}_{j}={\mathrm{log}}_{2}\left(\frac{n}{d{f}_{j}}\right)$$
(2)

\(t{f}_{jk}=\) the number of phrases \(j\) that occur in document \(k; n=\) the total number of documents in the document set; \(d{f}_{j}=\) the number of documents containing the phrase \(j\) in the document set.

Both tf and tf–idf can be computed as follows using TfidfTransformer from scikit-learn 1.0.2. With the feature implementation explained here, we will perform the classifier task with Naive Bayes Classifier for Multinomial Models, Linear Support Vector Machine and Logistic regression. We utilized the scikit-learn 1.0.2 pipeline class that behaves like a compound classifier, including CountVectorizer and TfidfTransformer.

Word2vec

The Word2vec model and application by Mikolov et al. (2013) have attracted significant attention in recent years. The vector representations of words learned by Word2vec models have been shown to carry semantic meanings and are helpful in various NLP tasks. In order to make a learning algorithm more efficient and eliminate the risk of over-fitting, we are seeking for a way to decrease dimensionality. This model is based on a neural network that is designed to reconstruct the linguistic contexts of words. Specifically, Word2vec uses a large corpus of text and produces a vector space model where words that share common contexts in the corpus (i.e., being nearby words) are positioned close to each other (Mikolov et al., 2013).

There are two different approaches to word2vec: continuous bag-of-words (CBOW) and continuous skip-gram (Rong, 2014). The CBOW model architecture predicts the current word based on the window of surrounding context words, while the skip-gram model uses the current word to predict the surrounding window of context words. Previous studies show that although slower, the skip-gram is more accurate and reliable for infrequent words versus the CBOW (Goldberg & Levy, 2014). This study adopts the skip-gram model considering that the distribution of products in the integrated patent-product database is highly skewed, and infrequent products in the database can provide valuable insights into potential technology opportunities. Figure 4 is a simple illustration of a Word2vec model showing the transformation of a document where a vector is built for each word in the document. Having all the word vectors and portraying them in a vector space, a distance measure can identify similar words to each other.

Fig. 4
figure 4

Illustrative representation of Word2vec with a sample of skip-gram model projected for country and capital vectors by Mikolov et al. (2013)

Figure 4 presents part of the results of the skip-gram model trained on country-capital city data (Mikolov et al., 2013). The model organizes concepts (i.e., country and capital city) and examines their relationships without any supplementary information. For instance, countries such as France and Germany are located nearby in the resulting vector space, and the distance between France and Paris corresponds with that of Germany and Berlin.

Doc2vec is a generalization of Word2vec with the distinction that it summarizes text contained within multiple documents. It is a straightforward extension of the Word2vec model which was developed to represent words meaningfully in a vector space (provide “word embeddings”) (Demeester et al., 2016; Mikolov et al., 2013). The objective of Word2vec is to situate words that have similar meanings close to one another. Similarly, Doc2vec aims to situate similar documents close to one another by placing document vectors (Docvec) close to each other in vector space. To do this, the algorithm uses the “context” around each term in the document to derive a vector representation that maximizes the probability of its appearance.

For operationalization, we benefited from Python’s Gensim Word2vec feature as it provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. The tool takes a text corpus as input and produces word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representations of words. We used the Google News corpus (3 billion running words) word vector model as a pre-trained model (3 million 300-dimension English word vectors).

Machine learning classification

Naive Bayes classifier for multinomial models

Naive Bayes is one of the algorithms used in our research to train a machine learning classification model. Naive Bayes is a probabilistic model which works well on text categorization (Weikum, 2002). Naive Bayes classifiers are built on Bayesian classification methods. These rely on the Bayes's theorem, which is an equation describing the relationship of conditional probabilities of statistical quantities. Naive Bayes classifier is a general term that refers to the conditional independence of each of the features in the model. In contrast, multinomial Naive Bayes classifier is a specific instance of a Naive Bayes classifier that uses a multinomial distribution for each feature. Manning et al. (2012) described that the multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as TF-IDF may also work. The probability calculation is described in Eq. 3:

$$P(c\mid d)\propto P(c)\prod_{1\le k\le {n}_{d}} P\left({t}_{k}\mid c\right)$$
(3)

where \(P(tk\mid c)\) is the conditional probability of the word \(tk\) that appears in a document having class \(c\). In the equation \(P(tk\mid c)\) is the likelihood probability of \(tk\) in class \(c\). While \(P(c)\) is the prior probability of the document appearing in class \(c\). Class determination compares the posterior probability results obtained, then the class with the largest posterior probability is the class chosen as the predicted result. The prior probability formula can be seen in Eq. 4:

$$P(c)=\frac{{N}_{c}}{N}$$
(4)

where \(Nc\) is the sum of category \(c\), while \(N\) is the sum of all categories. The formula for likelihood probability can be seen in Eq. 5:

$$ P(t_{k} |c) = \frac{{T_{tc} }}{{\sum\nolimits_{{t^{\prime} \in V}} {T_{{ct^{\prime\prime}}} } }} $$
(5)

where \({T}_{tc}\) is the number of occurrences of the word \(\mathrm{t}\) in the document having class c, and \(\sum\nolimits_{{t^{\prime} \in V}} {T_{{ct^{\prime\prime}}} }\) is the total number of occurrences of all words in class\(c\).

For operationalizing this method, we used the sklearn 1.0.2 “sklearn.naive_bayes.MultinomialNB” package featured in Python 3.8.5.

Linear support vector machine

Support vector machines are supervised machine learning algorithms that are very effective for classification problems (Awad & Khanna, 2015). Support vector machines (SVM) are primarily classifiers that can classify by constructing hyperplanes that separate cases that belong to different categories (Ingo & Andreas, 2008). In our application, the SVM classifier should first be trained on a set of tagged publications with their SDG category relevancy so that the machine gains knowledge for effective categorization.

In the following, we briefly describe its core concept and discuss some advantages that are relevant to the problem at hand. Simply put, the core idea of the method is to create a unique discrimination profile (represented by a linear function) between samples from different classes. We have an iterative training algorithm for the application to define a hyperplane that effectively separates two or multiple classes. During the training phase, a support vector machine partitions a high-dimensional space based on the points it contains that belong to known classes.

The ‘‘touching’’ data points are termed support vectors. In fact, the resulting separation plane is shaped only by these constraining (= supporting) points. Below, we provide the mathematical notation of a support vector machine following Hsu et al. (2010), an article which is a comprehensive introduction to the method for purposes such as ours. Formally defined, we have a training set \(\left({x}_{i},{y}_{i}\right)\) of \(i=1,\dots ,1\) sample points (here: our publications), where every \({x}_{i}\in {R}^{n}\) is an attribute vector (consisting of our normalized word and n-gram frequencies) and \({y}_{i}\in \{-\mathrm{1,1}{\}}^{l}\) is a decision for that specific data point which thus defines its class. The SVM then yields the solution to the following optimization problem (see as well Boser et al. 1992; Guyon et al. 1993):

$$ \begin{array}{*{20}l} {\;\;\;\;\mathop {\min }\limits_{w,b,\xi } \frac{1}{2}w^{{\text{T}}} w + C\mathop \sum \limits_{i = 1}^{l} \xi_{i} } \hfill \\ {{\text{s.t.}}\;y_{i} \left( {w^{{\text{T}}} \Phi \left( {x_{i} } \right) + b} \right) \ge 1 - \xi_{i} } \hfill \\ {\;\;\;\;\;\;\;\;\;\;\;\;\;\;\xi_{i} \ge 0} \hfill \\ \end{array} $$
(6)

in which \(W\) is the normal vector between the separating hyperplane and the parallel planes spanned by the support vectors. The mapping \(\Phi \) is related to so-called Kernel functions, such that \(K\left({x}_{i},{x}_{j}\right)\equiv\Phi {\left({x}_{i}\right)}^{T}\Phi \left({x}_{j}\right).\) For problems in which the data under consideration are not linearly separable, \(\Phi \) maps the training attributes into a higher dimensional space where a hyperplane may be found.

The above version of the classification procedure also incorporates the so-called SoftMargin method (Cortes and Vapnik 1995) that allows for mislabelled training sample points. The approach introduces \({\xi }_{i}\) as nonnegative slack variables which measure the extent of incorrectly classified items in the training set. \({\sum }_{i=1}^{l} {\xi }_{i}\) is thus a penalty term, and C a penalty parameter, on which we will comment later. For operationalizing this method, we used the sklearn 1.0.2 “sklearn.linear_model.SGDClassifier” package featured in Python 3.8.5.

Logistic regression

Logistic regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable (Cramer, 2005). In Logistic Regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y = 1) as a function of X.

Multinomial Logistic Regression (MLR) studies the probability of the D-dimensional point \(xi\in \{x1,x2,x3\dots ,xN\}\) pertinent to a class \(k\in \{\mathrm{1,2},\dots ,K\}.\) The data of the shape \((xi,yi)i=1,\dots ,N\) wherever \(xi\in R, \, \mathrm{d}\) is a d-dimensional representation of the feature vector and \(yi\in \{1,\mathrm{2,3},\dots ,K\}\) is a label connected with it; \(\mathrm{K}\) is the number of class labels. Let \(yik=I(yi=k)\) indicate the membership of data point \(\mathrm{x}\) i to class \(\mathrm{k}.\mathrm{W}=\) \(\{w1,w2,\dots ,wK\}\) indicates the parameter vector for all \(\mathrm{K}\) classes. The probability that \(\mathrm{xi}\) is pertinent to class \(\mathrm{k}\) is given by:

$$p(y=k\mid xi)=\frac{\mathrm{exp}\left({w}_{K}^{T}xi\right)}{{\sum }_{j=1}^{K} \mathrm{exp}\left({W}_{K}^{T}xi\right)}$$
(7)

For model building we used the built-in function in Python 3.8.5 Scikit-learn 1.0.2 “sklearn.linear_model.LogisticRegression”.

Model assessments metrics

We measure the performance of the models by precision (also called positive predictive value) which is the fraction of relevant instances among the retrieved instances, while recall (also known as sensitivity) is the fraction of relevant instances that were retrieved. In an imbalanced classification problem with more than two classes, precision is calculated as the sum of true positives across all classes divided by the sum of true positives and false positives across all classes (Sokolova & Lapalme, 2009). As a result, the generalization from a two-class confusion matrix when extrapolated to a multiclass, would be to sum over rows / columns of the confusion matrix. Given that the matrix is oriented as a two-class case, i.e., that a given row of the matrix corresponds to specific value for the "truth", we have:

$${\mathrm{Precision}}_{i}=\frac{{M}_{ii}}{{\sum }_{j} {M}_{ji}}$$
$${\mathrm{Recall}}_{i}=\frac{{M}_{ii}}{{\sum }_{j} {M}_{ij}}$$
$$F=2\cdot \frac{\mathrm{precision}\bullet \mathrm{recall}}{\mathrm{precision}+\mathrm{recall}}$$

That is, precision is the fraction of events (M) where we correctly declared \(i\) out of all instances where the algorithm declared \(i\). Conversely, recall is the fraction of events where we correctly declared \(i\) out of all of the cases where the true of state of the world is \(i\). Both precision and recall are therefore based on relevance and F1 Score is the weighted average of precision and recall. For reporting on the above-mentioned measures, we used the sklearn 1.0.2, “classification_report” feature in Python 3.8.5.

Analysis and results

Machine learning model design and deployment

The described approach to identify relatedness of patents to SDG goals is implemented in Python programming language. At the initial stage, the textual content in the patents and publications was preprocessed with python functions to clean irregular characters, harmonize and tokenize the input texts. We performed a strict cleaning process regarding stop words and meaningless words, which were facilitated by visualization of top word clusters to identify such words. Text preprocessing is traditionally essential for natural language processing (NLP) tasks. It transforms text into numerical values so that machine learning algorithms can perform better. The phases for text preprocessing include, stop word removal, stemming, lemmatization which at the end aims to normalize all the text on a level playing field. The text will also be tokenized, which is about splitting text strings into smaller pieces, or “tokens”. We also remove noise from the initial text, such as cleaning up text from extra whitespaces, lowercase all text, and removing special characters. Finally, we will convert our text documents into a numerical representation with either TF-IDF (term frequency-inverse document frequency) or other vectorizing techniques depending on the classification approach. After that, we train several classifiers using of Python’s Scikit-Learn library and Gensim library.

Once the features are generated from the text, the machine learning classifier can be trained with the labelled publication data we had collected earlier. Five different classification strategies have been tested, and the performance of each model is reported in Table 1. The overall accuracy of the models did not reach above 60% (based on the first prediction of the model). This benchmark is not a high standard of accuracy for accepting the model for good classification of all the classes. Nevertheless, our model experimentations deliver acceptable accuracy (above 60%) for most of the SDG classes such as: SDG 1, 2, 3, 4, 5, 6, 7, 9, 10, 13 and 16. This highlights that some of the SDGs seems to be highly difficult to identify from text, while many of the classifiers operate with significant accuracy. This ultimately leads to some limitations in the ability to identify some of the SDGs. Overall, based on the model comparison in Table 1, the highest overall accuracy (f-score) is achieved by “Word2vec and logistic regression” models. This model is selected for further analysis while keeping in mind the limitations of identifying e.g. SDG 8, 14 and 15.

Table 1 Classification models performance comparison

The classifier performances reflected in Table 1 is varied among the SDG classes. Over the iterative training, testing, and validating process, we tried to adopt the recommended measures to enhance the classification performances, yet some classes suffer from very low accuracy. This has multiple reasons which we tried to elaborate on the nature of such an issue in the paper. Other reports and early studies also admit the unbalanced performance between SDG classes (Performance of the SDG-BAI algorithm developed by OECD (2021a)). One initial cause is the SDG classes definitions. While some classes have a certain focus (i.e. SDG7), the others are described more broadly, making it difficult to isolate a specific artefact that has addressed that particular class. For example, looking at SDG classification of Scopus publication based on publication volume and being in citation percentile indicates categories are over-represented, and some have minor representation. Figure 2 shows the publication volume of SDG relevant publications in Scopus with the cohort positioning in the top citation percentiles. The figure indicates the disparity in each SDG category where SDG7 is dominant in size and citation percentile.

There have been experiments indicating the similarity between categories, which reduces the classification task in distinguishing each class separately. Based on PatentsightFootnote 9 metrics, we also realized a very low identification of some of the SDG classes, which we also had difficulties isolating, such as 15, 14 and 8. Nevertheless, our focus has been to expand the identification capability of SDG from the classes where we had better performance and therefore made the case analysis on SDG7, which was among our highest performing classes.

The classifier is used on the patent data to evaluate patents’ relatedness to the SDGs. Patent data is collected from the Patbase service. The data is limited to the year 2020 and all the granted patent families, and for these we retrieved textual content (title and abstract), patent number, assignee and patent country code. This resulted in a dataset of 132,226 patent families which is produced by over 37 thousand assignees. Appendix 1 offers some descriptive statistics regarding the patent data.

The patent data was processed using the same preprocessing steps used in publication data. After passing the preprocessed texts into the ML model, we were able to return the relevancy of each patent document to any of the SDGs. The result is a probability distribution for each document to be relevant to each SDG. Keeping in mind the ability of the classifier to predict relatedness to specific SDGs, we have aggregated the probability score of each record regarding each SDG category. Figure 5 presents the accumulative highest to lowest relevant SDGs which was addressed in the patent application texts.

Fig. 5
figure 5

2020 EPO patents and their orientation towards SDG categories

Figure 5 indicates that more than 20% of patent families address SDG 7 in their textual content. The ML model by design can give us the affiliation of the patent text to any of the SDGs. This in practice results in knowing to what other SDG classes a patent text is close to. In Appendix 4, we have illustrated the top three ML guesses of patents.

In order to benchmark the ML model ability to identify SDG relatedness, a lexical query was retrieved and evaluated. Based on the ML model results in terms of precision, recall and overall f-score performance, the Word2vec classifier was the most accurate in detecting SDG7 relevance. Therefore, we continue to detect the SDG 7 categorized patents in our EP families set as a good candidate subset for benchmarking. We adopted the lexical query (Appendix 1) used for SDG 7 publications and used it to retrieve granted patents for 2020 from Patbase. This resulted in a dataset of 1272 patent families.

Applying the ML model for detecting the SDG 7 categories in 132 k patent sample results in almost 10 thousand patent families, which have the probability of above 85% for SDG 7. Lowering the probability percentage to over 70% results in 15,500 patent families. Cross-referencing (1) SDG7 query-driven records, (2) ML model-driven records for SDG7 at a probability level 85% or above, we used the International Patent Classification (IPC) to check the number of patent families included in each patent class. Figure 6 illustrates the difference in the number of patent families in top similar IPC classes between lexical query and ML detected patents.

Fig. 6
figure 6

Difference of family sizes in IPC classes between lexical query and ML resulted query

The similar IPC classes between both sets go over 175 (out of 380 classes). While the proportional volume of patents in each IPC class differs between the two sets, the proportional difference in similar classes will not exceed more than 4%. This analysis suggests that both lexical query resulted patent families and ML decided patent families for SDG7 overlap overall, but there is some difference of coverage in terms of IPC classes as well.

Figure 7 illustrates the distribution of patent families over IPC classes for the ML model.

Fig. 7
figure 7

Top IPC classes within 2020 EPO patent families classified as SDG 7 by the ML model

The ML model expanded the scope of IPC classes by 14%, where 75 new classes were identified to be relevant patent families for SDG7. On the contrary, 68 (14%) classes identified with the lexical search query were not identified by the ML model top 30% results (Fig. 8 is a visual illustration of top IPC classes). However, our further checks on the ML results for 50% to 70% probability levels highlighted that most of the IPC classes are covered in that cohort. Relaxing the requirement on which a patent is associated with a specific SDG seems to have an impact on the results.

Fig. 8
figure 8

Top IPC classes within SDG 7 lexical query identified patent families

For learning the thematic of the ML model-driven patent families for SDG 7, Patbase keyword landscaping analysis, which uses the most commonly occurring keywords (found within titles and abstracts), was used. In contrast to IPC class comparison, this view could add an angle on the thematical coverage of the patent families with the families clustered by either the relevant keywords or the concepts. We shortlisted the keywords based on the occurrence of both SDG 7 lexical query-driven results and ML model detected results. The keywords’ occurrences are in a two-level hierarchy, so we picked the seven top clusters and their affiliated sub-cluster for visualization purposes. Figure 9 illustrates the keyword clusters in a network visualization with nodes as clusters and edges as subclusters to the higher-level clusters.

Fig. 9
figure 9

Network visualization of clusters affiliated to IPC classes

The colour of the new keyword clusters identified by the ML model is marked as green while the overlapping clusters within two modes of SDG 7 identification patent families remained blue. It is noticeable that the ML model was able to detect patent families that are thematically aligned within the first level cluster with the lexical query-based method. However, at the same time, new sub-clusters were introduced. For example, from the figure, we can identify technology clusters like a rechargeable battery in sub class electronic devices that ML model has identified as SDG7 related. This identification is an extension for SDG 7, which has been learned from SDG 7 related publications and is now able to be transformed and performed on different artefacts (in our case, patents). The data file offers the patent families identified as SDG7 category based on our ML method with a probability of over 80%. In addition, we have included the keyword clusters generated from the patents IPC classes to give more information on the content of the patents and furthermore to be used as a lexical query for keyword searches.

Discussion and conclusion

The United Nations Sustainable Development Goals (SDGs) offer a framework to achieve a better and more sustainable future. Governments worldwide have already agreed to these goals. Now the emphasis is to take action on a global as well as on local levels. Scientific and technological innovations are necessary but enabling them to make an impact requires an understanding of their utility to the sustainable positive economy. Our study contributes to systematically comprehending sustainability-oriented science, technology and innovation. It offers a systematic identification approach for sustainable development goals, requirements and objectives. Second, based on the publications with the highest relatedness to SDGs, the study trains and creates a machine learning model to detect the relatedness of scientific publications and, finally, uses the approach to analyse patent documents. This approach extends previous work on novel ways of identifying science and technology linkages (Ranaei et al., 2017) by overlaying the SDG context. While we can question the linearity of innovation (Suominen & Seppänen, 2014), understanding the co-evolution of science and technology is vital for the economy and society overall (Pavitt, 1991). Development intersects with IP policies as creativity and innovation are either fostered or frustrated by an economy’s chosen development policy. Therefore, including consideration of the SDGs in IP policy could lead to more significant and more lasting success.

Based on the UN’s SDG definition, we queried for relevant publications which address the 16 SDGs. This approach resulted in a vocabulary to capture the breadth and depth of SDGs within scientific publications. The comprehensive taxonomy created for SDG identification was used to create labelled data to train a machine learning model. Benchmarking with five classification methods resulted in identifying Word2vec with Logistic Regression as the top performer in the multiclassification task. The performance metrics were satisfactory for ten SDG categories. The highest performing model was then extended to unseen documents, in this case patents. The patents were assigned a SDG relevance at different levels of probability. In order to test the validity of the ML model in detecting SDG contribution in patents, the classification was compared to the already existing lexical queries using patent classifications. The analysis illustrated that the ML model increased the identification of SDG-oriented patent families in the year 2020 for EPO patents by 14.5% on the IPC class level.

Our study showed the application of machine learning for expanding lexical query and information retrieval. In response to such complex and multidimensional topics as SDGs, comprehending their definitions of breadth and depth often becomes a challenge. Importantly not all scientific and technological artefacts share the same features, and for example, patent documents, due to their technical nature, do not directly correspond to an SDG definition. While the linkage between science and technology and publication and patents is evident, we could utilize this connection and expand the identification of SDG oriented artefacts from publications to patents.

The implication of our study is multi-faceted. Moving towards further contribution aligned with SDGs requires an accurate understanding of historical STI efforts. The opportunities to identify STI artefacts towards SDGs could give an accurate image of progressed, challenged and undermined domains. Therefore, the implicit implications of the study can provide an overview of the STI contributions to SDGs on a macro level so far. On a micro level, it guides companies on how they can align their strategies as well as measure and manage their contribution to the realization of the SDGs when it comes to IP strategies.

The policy relevance of our results comes from the call for measurement frameworks beyond productivity and gross domestic product (Hayden, 2021; Malay, 2021; Schreyer, 2021). There have been calls from policymakers to create a measurement framework enabling analysis of the sustainable transition. The ability of currently available Beyond GDP measures seem to have low potential in capturing the transformation (Malay, 2019). Malay (2021) particularly called for the creation of cohesive measurement frameworks built around a shared goal. SDG is a good representation of this type of shared goal. A good practical example of the policy relevance of the Beyond GDP agenda is the European Commission initiative to develop indicators that would be clear, transformative and more inclusive to environmental and social aspects.Footnote 10 This has led, for example, for the European Commission to call for the development of a Beyond GDP measurement framework.Footnote 11

There are of course limitations to this research. First, we have analysed an exhaustive set of publication labelled data and the applicability to transform it via an ML model to detect patent data. Despite this exercise, the results may differ for other types of text sources, languages, and classification objectives. For example, the OECD’s algorithm for detecting SDG relevancy in text has been trained with 6,000 labelled descriptions of firms' actions (OECD, 2021b). The difference in performance of their SDG identification algorithm estimation is noticeable as the training data nature is different compared to our work. Second, our work departed from the patent and publication merging approach by (Ranaei et al., 2017), which also framed the approach in our study. This being said, employing algorithms and methods beyond the ones we study such as deep learning approaches could have offered an additional vantage point. However, we believe we cover a representative set of both machine learning and lexical methods and demonstrated their functionality in our classification task.

Third, we also follow prior comparisons by applying standard procedures in terms of preprocessing and document representation. There are many ways of extending this to the specific task at hand, which can improve performance. Overall, our results can be viewed as a lower performance boundary which further emphasizes the potential of automated text classification in discovering STI and SDG connections. Still, an even more extensive optimization may produce different results. We hope these findings make sound automated text classification more approachable to STI researchers and encourage future research to integrate novel approaches for complex classification tasks such as SDGs.